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Abstract 

The  task  of  suggesting  recipients  for  an  email 
has  recently  received  attention  as  it  has  po¬ 
tential  to  enhance  the  flow  of  knowledge 
and  information  within  an  organization  or 
social  network.  We  investigate  two  trans¬ 
fer  learning  techniques  to  improve  recipient 
prediction  performance  through  considering 
predictions  for  multiple  users.  We  present 
a  novel  continuous  hidden  variable  condi¬ 
tional  random  field  for  the  recipient  pre¬ 
diction  problem.  We  characterize  this  con¬ 
struction  as  a  type  of  discriminative  author 
recipient  topic  or  DART  model.  First  we 
show  transfer  based  performance  increases 
achieved  through  shared  hidden  variables  for 
prediction  across  different  users.  Second,  we 
show  how  transfer  from  an  organization  wide 
model  to  a  user  specific  model  through  para¬ 
meter  prior  structure  also  confers  substantial 
advantage,  especially  when  models  are  con¬ 
structed  for  new  users. 

1  Introduction 

The  problem  of  CC  prediction  was  introduced  in  [10] 
along  with  a  number  of  generative  probabilistic  mod¬ 
els  for  solving  the  problem.  Here  we  use  discriminative 
methods  for  the  general  problem  of  recipient  prediction 
and  focus  upon  an  exploration  of  two  transfer  learn¬ 
ing  techniques.  We  show  that  it  is  possible  to  leverage 
the  information  contained  within  the  related  predic¬ 
tion  tasks  for  different  users  and  increase  overall  pre¬ 
diction  performance.  These  methods  also  show  great 
potential  for  increasing  the  performance  of  models  for 
new  users. 

There  has  been  growing  interest  in  the  exploration  of 
transfer  learning  methods  within  the  Machine  Learn¬ 
ing  community  [3,  6,  1,  5,  11,  9].  Early  work  in  [3] 


describes  multitask  learning  as  an  approach  to  exploit 
information  used  for  the  training  of  other  tasks  to  im¬ 
prove  a  given  task.  Other  recent  work  explores  multi¬ 
task  learning  based  on  the  minimization  of  regular¬ 
ization  functionals  in  the  context  of  Support  Vector 
Machine  (SVM)  based  approaches  [5].  In  contrast,  the 
first  aspect  of  our  exploration  here  uses  a  discrimina¬ 
tive  low  dimensional  latent  variable  representation  to 
make  predictions  for  different  senders  across  an  orga¬ 
nization.  Our  approach  is  thus  related  to  [6]  which 
sought  to  obtain  a  low  dimensional  latent  space  suit¬ 
able  for  multiple  image  classification  tasks.  The  second 
component  of  our  exploration  uses  a  graphical  model 
over  parameter  priors  to  transfer  information  from  an 
organization  wide  model  for  predictions  to  user  specific 
models.  Recently,  [11]  also  explored  transfer  learning 
using  more  sophisticated  parameter  priors  but  for  sim¬ 
pler  discriminative  models. 

2  Recipient  Prediction  with  DARTs 

Email  recipient  prediction  in  the  context  of  a  recom¬ 
mending  system  for  users  in  a  social  network  or  or¬ 
ganization  is  a  challenging  problem  for  a  number  of 
reasons,  including:  (1)  there  are  possibly  hundreds 
or  even  thousands  of  possible  recipients;  (2)  the  true 
number  of  reasonable  potential  recipients  is  typically 
unknown;  furthermore,  (3)  while  some  suggestions 
may  indeed  be  reasonable,  unless  extensive  analysis 
and  hand  labeling  is  used  for  augmenting  labels,  these 
suggestions  may  be  flagged  as  incorrect.  While  these 
issues  are  important  to  consider,  augmentations  to  re¬ 
cipient  lists  can  be  subjective  and  therefore  we  eval¬ 
uate  suggestions  based  on  a  test  subset  of  emails  and 
their  observed  recipients. 

Topic  models  have  also  received  substantial  recent  at¬ 
tention  in  the  Machine  Learning  community.  A  va¬ 
riety  of  methods  have  emerged  as  alternatives  to  the 
original  approach  to  Latent  Semantic  Analysis  (LSA) 
[4]  or  direct  Principal  Component  Analysis  (PCA)  [7] 
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of  a  term  document  matrix.  Latent  Dirichlet  Alloca¬ 
tion  (LDA)  [2]  is  widely  regarded  as  a  state  of  the  art 
topic  modeling  method  when  one  wishes  to  use  only 
the  words  of  a  document  to  obtain  topics.  Recently, 
McCallum  et  al.  [8]  extended  the  basic  LDA  approach 
to  include  explicit  Author,  Recipient  and  Topic  vari¬ 
ables,  we  shall  refer  to  this  approach  as  ART  models. 
These  models  are  very  effective  at  extracting  mean¬ 
ingful  topics  and  have  been  shown  to  reveal  user  roles 
in  social  networks.  Despite  these  attractive  attributes, 
our  experiments  with  these  types  of  ART  models  for 
the  task  of  recipient  prediction  have  produced  per¬ 
formance  far  below  the  baseline  method  discussed  in 
Section  3.  However,  as  discussed  in  Section  1,  hid¬ 
den  variables  and  sophisticated  prior  structures  can 
be  used  to  implement  a  variety  of  transfer  learning 
approaches.  These  factors  motivate  our  development 
of  the  following  Discriminative  Author  Recipient  Topic 
(DART)  models  and  two  approaches  to  transfer  learn¬ 
ing  with  DARTs. 


Directed,  ART  Model  (Discrete) 


Figure  1:  Our  DART  model  vs.  the  ART  model  of  [8] 


In  the  following  exposition  we  present  a  continuous 
hidden  variable  random  field  based  DART  model.  Our 
model  can  be  characterized  as  a  type  of  discriminative 
Boltzmann  machine,  or  a  rich  discriminative  multino¬ 
mial  generalization  of  probabilistic  PCA  [12].  We  use 
random  variables  and  notation  shown  in  figure  1.  Fig¬ 
ure  1  (left)  illustrates  our  model  as  a  plated  random 
field  and  contrasts  our  DART  model  with  the  ART 
model  of  [8]  (right).  Our  DART  model  encodes  the 
conditional  probability  of  recipients  and  hidden  topics 
given  words  and  the  email  author  as: 
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where  we  have  coupled  the  random  variables  by  the 
connection  matrices  Mw,  Mr  and  Ma.  Z  is  an  in¬ 
tractable  normalization  constant.  For  notation  conve¬ 
nience,  we  set  M'-y  =  0,  for  *  =  1, •  •  •  ,T,  M’iR  =  0, 


for  *  =  1 ,  •  •  •  ,  T,  and  M° \  =  0,  for  i  =  1,  •  •  ■  ,  T.  Inte¬ 
grating  out  the  uncertainty  of  hidden  topics,  we  seek 
to  optimize  the  marginal  conditional  likelihood  of  re¬ 
cipients  given  words  and  authors  for  the  corpus, 
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SYMBOL 

DESCRIPTION 

T 

number  of  topics 

Nd 

number  of  emails 

V 

number  of  unique  words 

R 

number  of  recipients 

A 

number  of  authors 

Nw 

number  of  word  tokens  in  email  d 

sd 

number  of  recipients  on  email  d 

Mw 

T  x  (V  —  1)  word  connection  matrix 

Mr 

T  x  [R  —  1)  recipient  connection  matrix 

Ma 

T  x  (A  —  1)  author  connection  matrix 

tdi 

the  ith  topic  of  email  d 

Wdj 

the  jth  word  of  email  d 

Tdk 

the  kth  recipient  of  email  d 

ad 

the  author  of  email  d 

Table  1:  Notation  used  in  this  paper 


To  perform  learning  with  our  DART  model  we  take  the 
gradient  of  the  conditional  log  likelihood  and  arrive  at 
the  following  update  rules: 
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where  or  equivalently  mdk,  d  =  1 ,D,  and 
k  =  1, . . . ,  V  —  1  denote  draws  from  a  Gibbs  sampler 
and  I{q  G  Q )  and  I  {a  =  b)  are  indicator  functions. 
Under  the  DART  model  here,  the  conditional  distrib¬ 
utions  required  for  the  sampler  are  either  log-normal 
for  t  or  multinomial  for  the  other  variables.  As  with 
more  tractable  conditional  random  fields,  these  up¬ 
dates  have  an  intuitive  interpretation  as  consisting  of 
differences  between  expectations  involving  the  empir¬ 
ical  or  data  distribution  and  expectations  based  on 
(approximated)  model  distributions.  The  final  terms 
in  these  updates  arise  from  the  use  of  a  zero  mean 
Gaussian  prior  for  parameters. 


Figure  2:  Transfer  learning  using  information  from  an  or¬ 
ganization  scale  model  (left)  to  user  specific  models. 


3  Results  and  Conclusions 

To  provide  a  straightforward,  intuitive  evaluation  we 
use  a  mean  reciprocal  rank  (MRR)  metric.  In  infor¬ 
mation  retrieval  the  reciprocal  rank  of  a  test  document 
is  the  reciprocal  of  the  rank  at  which  the  first  relevant 
response  was  returned,  or  0  if  none  of  the  responses 
contained  a  relevant  answer.  The  score  for  a  sequence 
of  queries  is  the  mean  of  the  individual  query’s  recip¬ 
rocal  ranks. 

For  our  experiments  we  use  two  approaches  to  make 
recipient  predictions.  In  our  first  experiment,  we  use 
the  model  itself  to  make  recipient  predictions  by  com¬ 
puting  the  mean  of  the  hidden  variable  distribution 
given  observed  author  and  word  features.  We  then 
compute  the  multinomial  distribution  obtained  when 
conditioning  upon  this  sample.  The  ordering  of  re¬ 
cipients  produced  by  this  distribution  is  then  used  to 
determine  the  MRR  by  finding  the  first  predicted  re¬ 
cipient  also  on  the  list  of  test  email  recipients.  Our 
second  experiment  uses  an  approach  based  on  com¬ 
puting  the  cosine  similarity  within  the  latent  space  for 
a  given  test  email  and  all  training  set  emails.  Pre¬ 
dictions  are  then  obtained  from  recipients  of  the  re¬ 
trieved  documents.  Finally,  our  baseline  method  is  a 
term  frequency,  inverse  document  frequency  (TFIDF) 
based  cosine  similarity  computation  using  the  original 
document  vectors  and  the  same  MRR  computation  as 
the  latent  space  method. 

We  use  the  Enron  email  corpus  with  the  processing  de¬ 
scribed  in  [8].  The  resulting  corpus  consists  of  23,488 
email  messages  sent  among  147  users.  Emails  that 
were  not  received  by  at  least  one  of  the  147  users  are 
not  included.  In  order  to  capture  only  the  new  text 
entered  by  the  author  of  a  message,  “quoted  original 
messages”  in  replies  were  removed  using  some  heuristic 
methods.  Finally,  to  remove  sensitivity  to  capitaliza¬ 
tion,  all  text  was  downcased.  Finally,  we  randomly 
partition  the  data  set  into  .9,  .1  percent  training  and 
test  sets  for  our  experiments  and  use  200  topics. 


For  our  first  experiment  we  use  the  shared  hidden 
variable  structure  of  our  DART  model  to  learn  a  la¬ 
tent  space  model  for  predictions  across  all  users  in  our 
training  set.  From  figure  3  (top)  we  see  that  shared 
variable  transfer  learning  increases  MRR  across  the  en¬ 
tire  Enron  corpus  by  10%  over  the  non-transfer  TIDIF 
baseline.  In  our  second  experiment  we  investigate  pa¬ 
rameter  prior  based  transfer  learning  as  illustrated  in 
figure  2.  We  first  learn  an  organization  wide  model 
and  then  achieve  transfer  by  using  the  parameters  of 
this  model  as  a  non-zero  mean  Gaussian  parameter 
prior  for  a  model  specifically  trained  for  user  50.  Fig¬ 
ure  3  (bottom)  compares  a  user  specific  model  trained 
with  transfer  with  a  user  specific  model  trained  using 
a  zero  mean  Gaussian  prior.  The  TFIDF  baseline  is 
also  given.  From  this  analysis  we  see  that  the  most 
dramatic  benefits  of  transfer  occur  early  in  learning 
but  benefits  persist  after  many  iterations  of  gradient 
based  learning. 


Figure  3:  Transfer  learning  using:  shared  hidden  variables 
(top),  parameter  priors  (bottom). 

In  conclusion,  we  have  shown  how  two  types  of  transfer 
learning  using  DARTs  confer  significant  advantages. 
Both  shared  hidden  variables  and  prior  based  methods 
improve  recipient  prediction  performance.  In  the  lat¬ 
ter  case  our  experiments  involved  a  user  with  over  2000 
emails  in  their  local  training  set.  Early  iterations  in 
learning  are  analogous  to  situations  where  fewer  train¬ 
ing  examples  are  available.  We  thus  expect  that  trans¬ 
fer  learning  methods  using  these  approaches  could  be 
most  beneficial  to  models  for  new  users. 
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